I am trying to present what issues are present in how p-values are currently used and which novel methods exist to potentially alleviate these issues.
I am not trying to advocate for Bayesian, Frequentist, or other methods.
I am simply looking into what may or may not be feasible ways to help researchers have more rigorous, reproducible and transparent results.
Let’s start simple and look at the general process used commonly:
We set up our hypotheses:
\(H_0: \mu_1=\mu_2\) \(H_1: \mu_1\ne \mu_2\)
Collect our data: \(\bar x_1 - \bar x_2\)
We put the data on a relative scale via a “test statistic”, e.g., \(t_{obs}\)
And compute the “\(p\)-value”
\[P(|t_{obs}| \ge T)\]
“the probability of observing a test statistics at least as extreme as what was observed if \(H_0\) is true and underlying model assumptions hold”
Very often, introductory statistics courses teach:
“Reject \(H_0\) when \(p < \alpha\)”.
What alpha do we use?
\(p<0.05 \to\) “reject H_0” or discovery of an “effect”
Cabras & Castellanos (2017) for bound generalization
Keep in mind this is projected under the assumption that neither hypothesis is favored prior to the study.
The American Statistician published a statement from the ASA board of directors on \(p\)-values.
No specifics were given, just “Do something else”.
In 2019, The American Statistician published a supplementary issue.
There were 43 (!) articles on the topic, and an editorial paper titled ‘Moving to a World Beyond “\(p < 0.05\)”’.
The articles fell into three general categories:
There were four major categories of ways to “move beyond \(p < 0.05\)”
The order of these are potentially how we actually progress chronologically.
The simplest method(s) involve a few of simple things.
Report your exact \(p\)-value to at least 3, ideally 4, significant digits.
Do not declare you have a “novel” discovery unless \(p \le 0.005\), i.e., \(\alpha = 0.005\).
| \(p>0.05\) | \(0.005 < p \le 0.05\) | \(p\le 0.005\) |
|---|---|---|
| “No reasonably detectable effect” |
“Suggestive” | “Discovery” |
Or by extension, treat \(p\)-values as a spectrum.
| \(p>0.1\) | \(p \approx 0.5\) | \(p \approx 0.01\) | \(p \approx 0.001\) | \(p \le 0.0001\) |
|---|---|---|---|---|
| Nothing notable | Suggestive | Meaningful | Strong | Very Strong |
This stems from the paper “Reducing Statistical Significance”.
Arguments for this change:
Arguments against from “Justify Your Alpha”:
This method focuses on trying to incorporate the idea of the Bayes Factor along with the p-value. (Olive branch to both parties?)
\[BF = \frac{\text{average}^* \text{ likelihood of data under } H_1}{\text{likelihood of data under } H_0}\]
Berger et al. 2001
\[BF \le BFB = \frac{1}{-e p \log(p)}\]
This provides a simple calculation for an approximate upper bound for BF. (Or lower bound using alternative BF)
In other words, this an approximate upper bound of a Bayesian measure of evidence in favor alternative.
This is the basis False Positive Rate discussed earlier.
Berger et al. show this holds under “general” conditions.
Recall this bound is under the assumption \(H_0\) and \(H_1\) are equally likely.
Benjamin & Berger (2019) suggest the following:
Do not use the word “significant” with \(p\)-values, i.e., don’t use a hard cutoff. Use words like “suggestive” or “strong” evidence. (\(p\)-values as a specturm).
Report the \(p\)-value along with the BFB. If \(p = 0.05\), then report BFB \(= 2.44\) along side it, i.e., <results statements> (\(p = 0.05\), BFB \(= 2.44\)).
Additionally report the fact that BFB is calculated based on a prior odds of \(1/2\) so additionally report the upper bound for posteriod odds.
\[\text{Posterior odds bound} = BFB\cdot\frac{1}{2}\]
An example point 3, using wording from Benjamin & Berger:
“the \(p\)-value of \(0.05\) corresponds to the upper bound on the Bayes factor of \(2.44\) which, combined with our prior odds of \(1:2\), implies post-experimental odds of at most \(\frac{2.44}{1}\times \frac{1}{2} = 1.22\) in favor of the alternative hypothesis.”
p-values are simply a probability measuring a standardized extremeness.
They do not indicate how precise our estimate is.
Nor do they indicate how useful the conclusions we make are.
A suggestion to alleviate this is to combine:
\(p\)-values to give us an indication of how strong our evidence is against \(H_0\).
Confidence intervals let us determine it’s precision.
If researchers determine a meaningful effect size prior to the study, we can gauge how meaningful the final results are.
Establish strength of evidence for an effect. Use desired probabilistic (or other) criterion.
Create an interval to establish precision of parameter estimate. Again, use preferred method.
Compare interval to what is considered a meaningful effect.
Use “thoughtful” judgement based on context and statistical information.
Report all information so that others can make their conclusions.
A new corn fertilizer is being investigated, 5cm increase in plant height is considered minimal meaningful effect.
Here are five potential outcomes of a study on the fertilizer.
One suggestion is concentrate on Shannon information of a \(p\)-value which is referred to as an \(s\)-value (Greenland 2019).
\[s = \log_2\left(\frac{1}{p}\right) = -\log_2(p)\]
This tells us how many bits of information we have against the null hypothesis.
Let \(k\) be the integer closest to \(s\), then \(p\) is roughly the same level of evidence against a coin being fair if we saw all heads in \(k\) flips.
\(p = 0.05\) \(\to\) \(s = -\log_2(p) = 4.3\): The evidence is not much more surprising than seeing 4 heads in a row when tossing a coin.
\(p = 0.005\) gives \(s = 7.6\): The observed data are a little less surpising than 8 coin flips of heads in a row.
“Would you believe a coin is unfair if you saw \(k\) heads in a row?”
Blume et al. (2019) suggest something they refer to as a “Second Generation \(p\)-Values” (SGPV).
The motivations for it are:
Incorporate the idea of scientifically important findings via defining a meaningful effect size.
Assess the validity of the hypothesis via a measure with same range of values as a \(p\)-value, \(0\) to \(1\), that does not confound sample size and effect size.
Keep the simplicitly of an automatic decision rule that maintains as simple relation with Type I error specified by \(\alpha\).
The procedure can be simplified into a few stages.
Specify a range of values for the effect which would be considered not meaningful.
Let \(\theta\) be the parameter/effect of interest:
\[H_0: \theta_L < \theta < \theta_U\]
Anything in between is considered un important.
Gather data and compute an interval using your preferred method be it Frequentist, Bayes, or other.
Compute SGPV, \(p_\delta\), the proportion of the intervals that overlap. (Details to follow.)
Figure from Blume et al. (2019):
Denote the width of the computed interval and \(|I|\) and the null hypothesis interval width by \(|H_0|\).
\(I\) is “precise”, which is defined as \(|I| \le 2|H_0|\).
\[p_\delta = \frac{\text{Overlap of $I$ and $H_0$}}{|I|}\]
\(I\) is “imprecise” which is defined to be \(|I| > 2|H_0|\)
\[p_\delta = \frac{1}{2} \cdot \frac{\text{Overlap of $I$ and $H_0$}}{|H_0|}\]
Precise versus imprecise definitions prevent small values of \(p_\delta\) when the intervals are excessively large.
Given the definition of \(p_\delta\), potential values range from 0 to 1.
\(p_\delta = 0\) indicates that the data are not compatible with \(H_0\) according to the error criterion you used for \(I\).
\(p_\delta = 1\) indicates that data are compatible with \(H_0\)
\(p_\delta = 0.5\) indicates the data are completely inconclusive with regards to \(H_0\) compatibility.
Mathews (2019) proposes what they refer to as Analysis of Credibility (AnCred).
\[\text{Prior insight } + \text{ Data likelihood } \to \text{ Posterior insight}\]
The general procedure is as follows:
Summarize study findings via 95% CI. (Different confidence levels are possible of course.) This is the data likelihood.
Use the CI to compute a Critical Prior Interval (CPI).
Based on using the ikelihood to deduce the range of prior effect sizes which, when combined with the likelihood, lead to a posterior range that just excludes no effect at the 95% level.
For “significant” results the a Skepticism CPI is calculated.
For “nonsignificant” results an Advocacy CPI is calculated.
In AnCred there are two situations when results are significant
There are prior studies to compare to: The findings are deemed credible if prior studies indicate an effect size beyond the Skepticism CPI.
An unprecedented result was found: the findings are deemed credible if the CI point estimate lies beyond the bounds of the Skepticism CPI.
Otherwise results may be “significant” but not credible.
For “nonsignificant” results, a non-zero effect size is determined to be credible if prior studies are within the Advocacy CPI.
The logistics are somewhat complex, for a deeper understanding please read the paper:
Robert A. J. Matthews (2019) Moving Towards the Post \(p < 0.05\) Era via the Analysis of Credibility, The American Statistician, 73:sup1, 202-212, DOI: 10.1080/00031305.2018.1543136